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Amendments to the Claims: 

This listing of claims replaces all prior versions and listings of claims in the application: 

Listing of Claims : 
1-54. (Canceled) 

55. (Previously presented) A computer-implemented method for identifying compounds in text, 
comprising: 

extracting a vocabulary of tokens from text; 

iterating from n > 2 down to n = 2 where n decreases by one each iteration and in each 
iteration performing the actions of: 

identifying a plurality of unique 77-grams in the text, each 77-gram being an 
occurrence in the text of n sequential tokens, each token being found in the vocabulary; 

dividing each 77-gram into 77-1 pairs of two adjacent segments, where each 
segment consists of at least one token; 

for each 77-gram, calculating a likelihood of collocation for each pair of segments 
of the 77-gram and determining a score for the 77-gram based on a lowest calculated likelihood of 
collocation; 

identifying a set of 77-grams having scores above a threshold; and 
adding the identified set of 77-grams as compound tokens to the vocabulary and 
removing constituent tokens that occur in the added compound tokens from the vocabulary. 

56. (Previously presented) The method of claim 55 where calculating a likelihood of collocation 
for each pair of segments of the n-gram comprises determining a likelihood ratio X for each pair 
of segments that is computed in accordance with the formula: 



where L{H t ) is a likelihood of observing H t under an independence hypothesis, L(H C ) is a 
likelihood of observing H c under a collocation hypothesis, and //is a pair of segments. 
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57. (Previously presented) The method of claim 56 where the L(H C ) is computed for each pair 

of segments, h, t 2 , in each rc-gram in accordance with the formula: 

Lit,, t 2 form compound) 

arg max — -, — — r . 

l{h,) L\n - gram does not form compound) 

58. (Previously presented) The method of claim 56 where, for each pair of segments, t\, h, in 
each n-gram, the independence hypothesis comprises P(t 2 \t l )= p{t 2 \ t x ) and the collocation 
hypothesis comprises P(t 2 \ t l )> p{t 2 | 

59. (Previously presented) The method of claim 55 where identifying a plurality of unique n- 
grams in the text comprises skipping /7-grams appearing in a list of known compounds. 

60. (Currently Amended) A storage device storing program code, which, when executed by a 
processor, causes the processor A comput e r program product, encoded on a computer readabl e 
medium, operable - to cause data processing apparatus to perform operations comprising: 

extracting a vocabulary of tokens from text; 

iterating from n > 2 down to n = 2 where n decreases by one each iteration and in each 
iteration performing the actions of: 

identifying a plurality of unique «-grams in the text, each n-gram being an 
occurrence in the text of n sequential tokens, each token being found in the vocabulary; 

dividing each n-gram into n-l pairs of two adjacent segments, where each 
segment consists of at least one token; 

for each n-gram, calculating a likelihood of collocation for each pair of segments 
of the n-gram and determining a score for the n-gram based on a lowest calculated likelihood of 
collocation; 

identifying a set of n-grams having scores above a threshold; and 
adding the identified set of n-grams as compound tokens to the vocabulary and 
removing constituent tokens that occur in the added compound tokens from the vocabulary. 



61 . (Currently Amended) The storage device program product of claim 60 where calculating a 
likelihood of collocation for each pair of segments of the n-gram comprises determining a 
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likelihood ratio X for each pair of segments that is computed in accordance with the formula: 



where L(Hj) is a likelihood of observing H t under an independence hypothesis, L(H C ) is a 
likelihood of observing H c under a collocation hypothesis, and H is a pair of segments. 

62. (Currently Amended) The storage device program product of claim 61 where the L(H C ) is 

computed for each pair of segments, t\, h, in each «-gram in accordance with the formula: 

L(t, , t 0 form compound) 

arg max —. v - r . 

l(h,) L{n - gram does not form compound ) 

63. (Currently Amended) The storage device program product of claim 61 where, for each pair 
of segments, tl, t2, in each n-gram, the independence hypothesis comprises P(t 2 | t x ) = p{t 2 \ t x ) 
and the collocation hypothesis comprises P(t 2 \t l )> p{t 2 \ t x ). 

64. (Currently Amended) The storage device program product of claim 60 where identifying a 
plurality of unique «-grams in the text comprises skipping «-grams appearing in a list of known 
compounds. 

65. (Previously presented) A system comprising: 

a computer readable medium including a program product; and 
one or more processors configured to execute the program product and perform 
operations comprising: 

extracting a vocabulary of tokens from text; 

iterating from n > 2 down to n = 2 where n decreases by one each iteration and in each 
iteration performing the actions of: 



occurrence in the text of n sequential tokens, each token being found in the vocabulary; 




identifying a plurality of unique «-grams in the text, each n-gram being an 



dividing each n-gram into n-l pairs of two adjacent segments, where each 
segment consists of at least one token; 

for each n-gram, calculating a likelihood of collocation for each pair of segments 
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of the ft-gram and determining a score for the «-gram based on a lowest calculated likelihood of 
collocation; 

identifying a set of n-grams having scores above a threshold; and 
adding the identified set of «-grams as compound tokens to the vocabulary and 
removing constituent tokens that occur in the added compound tokens from the vocabulary. 

66. (Previously presented) The system of claim 65 where calculating a likelihood of collocation 
for each pair of segments of the n-gram comprises determining a likelihood ratio X for each pair 
of segments that is computed in accordance with the formula: 

where L(Hi) is a likelihood of observing Hj under an independence hypothesis, L(H C ) is a 
likelihood of observing H c under a collocation hypothesis, and H is a pair of segments. 

67. (Previously presented) The system of claim 66 where the L(H C ) is computed for each pair of 

segments, t\, t 2 , in each w-gram in accordance with the formula: 

L(t,t? form compound) 

arg max — -, v 1 2 r . 

l{h,) L[n - gram does not form compound) 

68. (Previously presented) The system of claim 66 where, for each pair of segments, t\, t 2 , in 
each n-gram, the independence hypothesis comprises P(t 2 \ t x ) = p(t 2 \ t l ) and the collocation 




69. (Previously presented) The system of claim 65 where identifying a plurality of unique n- 
grams in the text comprises skipping n-grams appearing in a list of known compounds. 



